System and method for accessing heterogeneous databases

ABSTRACT

A system and method are provided for answering queries concerning information stored in a set of collections. Each collection includes a structured entity, and each structured entity includes a field. A query is received that specifies a subset of the set of collections and a logical constraint between fields that includes a requirement that a first field match a second field. The probability that the first field matches the second field is determined automatically based upon the contents of the fields. A collection of lists is generated in response to the query, where each list includes members of the subset of collections specified in the query, and where each list has an estimate of the probability that the members of the list satisfies the logical constraint specified in the query.

CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. ProvisionalApplication No. 60/039,576 filed Feb. 25, 1997.

FIELD OF THE INVENTION

[0002] This invention relates to accessing databases, and particularlyto accessing heterogeneous relational databases.

BACKGROUND OF THE INVENTION

[0003] Databases are the principal way in which information is stored.The most commonly used type of database is a relational database, inwhich information is stored in tables called relations. Relationaldatabases are described in A First Course on Database Systems by Ullmanand Widom, Prentice Hall, 1997, and in An Introduction to DatabaseSystems, by C. J. Date, Addison Wesley, 1995.

[0004] Each entry in a relation is typically a character string or anumber. Generally relations are thought of as sets of tuples, a tuplecorresponding to a single row in the table. The columns of a relationare called fields.

[0005] Commonly supported operations on relations include selection andjoin. Selection is the extraction of tuples that meet certainconditions. Two relations are joined on fields F1 and F2 by first takingtheir Cartesian product (the Cartesian product of two relations A and Bis the set of all tuples a₁, . . . , am, b₁, . . . , b_(n), where a₁, .. . , a_(m) is a tuple from A, and b₁, . . . , b_(n) is a tuple from B)and then selecting all tuples such that F1=F2. This leads to a relationwith two equivalent fields, so usually one of these is discarded.

[0006] Joining relations is the principal means, of aggregatinginformation that is spread across several relations. For example, FIG. 1shows two sample relations Q 101 and R 102, and the result of joining Qand R (the “Join” of Q and R) 103 on the fields named MovieID (thecolumns indicated by 104.) For reasons of efficiency, relations areusually joined on special fields that have been designated as keys, anddatabase management systems are implemented so as to efficiently performjoins on fields that are keys.

[0007] In most databases, each tuple corresponds to an assertion aboutthe world. For instance, the tuple<12:30, 11, “Queen of Outer Space(ZsaZsa Gabor)”, 137>(the row indicated by 105) in the relation Q 101 ofFIG. 1 corresponds to the assertion “the movie named ‘Queen of OuterSpace’, starring Zsa Zsa Gabor, will be shown at 12:30 on channel 11.”

[0008] Known systems can represent information that is uncertain in adatabase. One known method associates every tuple in the database with areal number indicating the probability that the corresponding assertionabout the world is true. For instance, the tuple described above mightbe associated with the probability 0.9 if the preceding program was amajor sporting event, such as the World Series. The uncertaintyrepresented in this probability includes the possibility, for example,that the World Series program may extend beyond its designated timeslot. Extensions to the database operations of join and selection usefulfor relations with uncertain information are also known. One method forrepresenting uncertain information in a database is described inProbabilistic Datalog—a Logic for Powerful Retrieval Methods” by NorbertFuhr, in Proceedings of the 1995 ACM SIGIR Conference on Research inInformation Retrieval, pages 282-290, New York, 1995. Other methods aresurveyed in Uncertainty Management in Information Systems, edited byMotro and Smelts, Kluwer Academic Publishers, 1997. Database systemsthat have been extended in this way are called probabilistic databases.

[0009] Another way of storing information is with a text database. Hereinformation is stored as a collection of documents, also known as acorpus. Each document is simply a textual document, typically in Englishor some other human language. One standard method for representing textin such a database so that it can be easily accessed by a computer is torepresent each document as a so-called document vector. A documentvector representation of a document is a vector with one component foreach term appearing in the corpus. A term is typically a single word, aprefix of a word, or a phrase containing a small number of words orprefixes. The value of the component corresponding to a term is zero ifthat term does not appear in the document, and non-zero otherwise.

[0010] Generally the non-zero values are chosen so that words that arelikely to be important have larger weights. For instance, word thatoccur many times is a document, or words that are rare in the corpus,have large weights. A similarity function can then be defined fordocument vectors, such that documents with the similar term weights havehigh similarities, and documents with different term weights have lowsimilarity. Such a similarity function is called a term-based similaritymetric.

[0011] An operation commonly supported by such text databases is calledranked retrieval. The user enters a query, which is a textualdescription of the documents he or she desires to be retrieved. Thisquery is then converted into a document vector. The database system thenpresents to the user a list of documents in the database, ordered (forexample) by decreasing similarity to the document vector thatcorresponds to the query.

[0012] As an example, the Review column (the column indicated by 107) ofrelation R 102 in FIG. 1 might be instead stored in a text database. Theanswer to the user query “embarrassingly bad science fiction” might be alist containing the review of “Queen of Outer Space” as its firstelement, and the review of “Space Balls” as its second element.

[0013] In general, the user will only be interested in seeing a smallnumber of the documents that are highly similar. Techniques are knownfor efficiently generating a reduced list of documents, say of size K,that contains all or most of the K documents that are most similar tothe query vector, without generating as an intermediate result a list ofall documents that have non-zero similarity to the query. Suchtechniques are described in Chapters 8 and 9 of Automatic TextProcessing, edited by Gerard Salton, Addison Wesley, Reading,Massachusetts, 1989, and in Query Evaluation: Strategies andOptimizations by Howard Turtle and James Flood in Information Processingand Management, 3 1(6):831-850, November 1995.

[0014] In some relational database management systems (RDBMS) relationsare stored in a distributed fashion, i.e., different relations arestored on different computers. One issue which arises in distributeddatabases pertains to joining relations stored at different sites. Inorder for this join to be performed, it is necessary for the tworelations to use comparable keys. For instance, consider two relations Mand E, where each tuple in M encodes a single person's medical history,and each tuple in E encodes data pertaining to a single employee of somelarge company. Joining these relations is feasible if M and E both usesocial security numbers as keys. However, if E uses some entirelydifferent identifier (say an employee number), then the join cannot becarried out, and there is no known way of aligning the tuples in E withthose in M. To take another example, the relations Q 101 and R 102 ofFIG. 1 could not be joined unless they both contained a similar field,such as the MovieID field (column 104.)

[0015] In practice, the presence of incomparable key fields is often aproblem in merging relations that are maintained by differentorganizations. A collection of relations that are maintained separatelyare called heterogeneous,. The problem of providing access to acollection of heterogeneous relations is called data integration. Theprocess of finding pairs of keys that are likely to be equivalent keymatching is called key matching.

[0016] Techniques are known for coping with some sorts of key mismatchesthat arise in accessing heterogeneous databases. One technique is tonormalize the keys. For instance, in the relations Q 101 and R 102 inFIG. 1, suppose that numeric MovieID's are not available, and it isdesirable to join Q 101 and R 102 on strings that contain the name ofthe movie, specifically, the MovieName field (the column indicated by106) of Q 101, and the underlined section of the Review field (thecolumn indicated by 107) of R 102. One might normalize these strings byremoving all parenthesized text (which contains actor's names in Q 101,and a rating in R 102).

[0017] A data integration system based on normalization of keys isdescribed in Querying Heterogeneous Information Sources Using SourceDescriptions, by Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille, in{Proceedings of the 22nd International Conference on Very LargeDatabases (VLDB-96)}, Bombay, India, September 1996.

[0018] Another known technique for handling key mismatches is to use anequality predicate, a function which, when called with arguments Key1and Key2, indicates if Key1 and Key2 should be considered equivalent forthe purpose of a join. Generally such a function is of limitedapplicability because it is appropriate only for a small number of pairsof columns in a specific database. The use of equality tests isdescribed in the Identification and Resolution of Semantic Heterogeneityin Multidatabase Systems, by Douglas Fang, Joachim Hammer, and DennisMcLeod, in Multidatabase Systems: An Advanced Solution for GlobalInformation Sharing, pages 52-60. IEEE Computer Society Press, LosAlamitos, Calif., 1994. Both normalization and equality predicates arepotentially expensive in terms of human effort: for every new type ofkey field, a new equality predicate or normalization procedure must bewritten by a human programmer.

[0019] It is often the case that the keys to be matched are strings thatname certain real-world entities. (In our example, for instance, theyare the names of movies.) Techniques are known for examining pairs ofnames and assessing the probability that they refer to the same entity.Once this has been done, then a human can make a decision about whatpairs of names should be considered equal for all subsequent queriesthat require key matching. Such techniques are described in RecordLinkage Techniques—1985, edited by B. Kilss and W. Alvey, Statistics ofIncome Division, Internal Revenue Service Publication 1299-2-96,available from {http://www.bts.gov/fcsm/methodology/}, 1985, as well asin the Merge/purge Problem for Large Databases, by M. Hernandez and S.Stolfo, in Proceedings of the 1995 ACM SIGMOD, May 1995, and HeuristicJoins to Integrate Structured Heterogeneous Data, by Scott Huffman andDavid Steier, in Working Notes of the AAAI Spring Symposium onInformation Gathering In Heterogeneous Distributed Environments, PaloAlto, California, March 1995, AAAI Press.

[0020] Many of these techniques require information about the types ofobjects that are being named. For instance, Soundex is often used tomatch surnames. An exception to this is the use of the Smith-Watermanedit distance, which provides a general similarity metric for any pairsof strings. The use of the Smith-Waterman edit distance metric keymatching is described in an Efficient Domain-independent Algorithm forDetecting Approximately Duplicate Database Records by A. Monge and C.Elkan, in The proceedings of the SIGMOD 1997 Workshop on Data Mining andKnowledge Discovery, May 1997.

[0021] It is also known how to use term-based similarity functions,closely related to IR similarity metrics, for key matching. Use ofterm-based similarity metrics for key matching, as an alternative toSmith-Waterman, is described in the Field-matching Problem: Algorithmand Applications by A. Monge and C. Elkan in Proceedings of the SecondInternational Conference on Knowledge Discovery and Data Mining, August1996.

[0022] In summary, known methods require that data from heterogeneoussources be preprocessed in some manner. In particular, the data fieldsthat will be used as keys must be normalized, using a domain-specificprocedure, or a domain-specific equality test must be written, or adetermination as to which keys are in fact matches must be made by auser, perhaps guided by some previously computed assessment of theprobability that each pair of keys matches.

[0023] All of these known procedures are require human intervention,potentially for each pair of data sources. Furthermore, all of theseprocedures are prone to error. Errors in the process of determiningwhich keys match will lead to incorrect answers to queries to theresulting database.

[0024] What is needed is a way of accessing data from many heterogeneoussources without any preprocessing steps that must be guided by a human.Furthermore, when pairs of keys from different sources are assumed tomatch, the end user should be alerted to these assumptions, and providedwith some estimate of the likelihood that the assumptions are correct,or other information with which the end user can assess the quality ofthe result.

SUMMARY OF THE INVENTION

[0025] An embodiment of the present invention accesses informationstored in heterogeneous databases by using probabilistic databaseanalysis techniques to answer database queries. The embodiment usesuncertain information about possible key matches obtained by usinggeneral-purpose similarity metrics to assess the probability that pairsof keys from different databases match. This advantageously allows auser to access heterogeneous sources of information without requiringany preprocessing steps that must be guided by a human. Furthermore,when pairs of keys from different sources are assumed to match, the useris apprised of these assumptions, and provided with some estimate of thelikelihood that the assumptions are correct. This likelihood informationcan help the user to assess the quality of the answer to the user'squery.

[0026] Data from heterogeneous databases is collected and stored inrelations. In one embodiment, the data items in these relations thatwill be used as keys are represented as text. A query is received by adatabase system. This query can pertain to any subset of the relationscollected from the heterogeneous databases mentioned above. The querymay also specify data items from these relations that must or shouldrefer to the same entity.

[0027] A set of answer tuples is computed by the database system. Thesetuples are those that are determined in accordance with the presentinvention to most likely to satisfy the user's query. A tuple is viewedas likely to satisfy the query if those data items that should refer tothe same entity (according to the query) are judged to have a highprobability of referring to the same entity. The probability that twodata items refer to the same entity is determined usingproblem-independent similarity metrics that advantageously do notrequire active human intervention to formulate for any particularproblem.

[0028] In computing the join of two relations, each of size N, N² pairsof keys must be considered. Hence, for moderately large N, it isimpractical to compute a similarity metric (and store the result) foreach pair. An embodiment of the present invention advantageously solvesthis problem by computing similarities between pairs of keys at the timea query is considered, and computing similarities between only thosepairs of keys that likely to be highly similar.

[0029] In some cases, many pairs of keys will be weakly similar, andhence will have some small probability of referring to the same entity.Thus, the answer to a query could consist of a small number of tupleswith a high probability of being correct answers, and a huge number oftuples with a small but non-zero probability of being correct answers.Known probabilistic database methods would disadvantageously generateall answer tuples with non-zero probability, which often would be animpractically large set. The present invention advantageously solvesthis problem by computing and returning to the user only a relativelysmall set of tuples that are most likely to be correct answers, ratherthan all tuples that could possibly be correct answers.

[0030] In one embodiment of the present invention, the answer tuples arereturned to the user in the order of their computed likelihood of beingcorrect answers, i.e., the tuples judged to be most likely to be correctare presented first, and the tuples judged less likely to be correct arepresented later.

[0031] In accordance with one embodiment of the present invention,queries concerning information stored in a set of collections areanswered. Each collection includes a structured entity. Each structuredentity in turn includes a field.

[0032] In accordance with an embodiment of the present invention, aquery is received that specifies a subset of the set of collections anda logical constraint between fields that includes a requirement that afirst field match a second field. The probability that the first fieldmatches the second field based upon the contents of the fields isautomatically determined. A collection of lists is generated in responseto the query, where each list includes members of the subset ofcollections specified in the query. Each list also has an estimate ofthe probability that the members of the list satisfies the logicalconstraint specified in the query.

[0033] The present invention advantageously combines probabilisticdatabase techniques with probabilistic assessments of similarity toprovide a means for automatically and efficiently accessingheterogeneous data sources without the need for human intervention inidentifying similar keys.

BRIEF DESCRIPTION OF THE DRAWINGS

[0034]FIG. 1 shows an prior art example of two relations Q and R and ajoin of relations Q and R.

[0035]FIG. 2 shows an embodiment of a system and apparatus in accordancewith the present invention.

[0036]FIG. 3 shows a-table of relations upon which experiments wereperformed to determine properties of the present invention.

DETAILED DESCRIPTION

[0037] An embodiment of an apparatus and system in accordance with thepresent invention is shown in FIG. 2. A search server 201, user 202, amddatabase server A 203, database server B 204 and database server C 205are coupled to network 206. Heterogeneous databases U 207, V 208 and W209 are coupled to database server A 203. Heterogeneous databases X 210and Y 211 are coupled to database server B 204. Heterogeneous database Z212 is coupled to database server C 213. User 202 submits a query tosearch server 101. Search server 101 conducts a search of heterogeneousdatabases U 207, V 208, W 209, X 210, Y 211 and Z 212 in an automaticfashion in accordance with the method of the present invention.

[0038] As shown in FIG. 2, search server 201 includes processor 213 andmemory 214 that stores search instructions 215 adapted to be executed onprocessor 213. In one embodiment of the present invention, processor 213is a general purpose microprocessor, such as the Pentium II processormanufactured by the Intel Corporation of Santa Clara, Calif. In anotherembodiment, processor 213 is an Application Specific Integrated Circuit(ASIC) that embodies at least part of the search instructions 215, whilethe rest are stored at memory 214. In various embodiments of the presentinvention, memory 214 is a hard disk, read-only memory (ROM), randomaccess memory (RAM), flash memory, or any combination thereof. Memory214 is meant to encompass any medium capable of storing digital data. Asshown in FIG. 2, memory 214 is coupled to processor 213.

[0039] One embodiment of the present invention is a medium that storessearch instructions. As used herein, the phrase “adapted to be executed”is meant to encompass instructions stored in a compressed and/orencrypted format, as well as instructions that have to be compiled orinstalled by an installer before being executed by processor 213.

[0040] In one embodiment, the search server further comprises a port 216adapted to be coupled to a network 206. The port is coupled to memory214 and processor 213.

[0041] In one embodiment, network 206 is the Internet. In anotherembodiment, it is a Local Area Network (LAN). In yet another embodiment,it is a Wide Area Network (WAN). In accordance with the presentinvention, network 206 is meant to encompass any switched means by whichone computer communicates with another.

[0042] In one embodiment, the user is a personal computer. In oneembodiment, database servers A 203, B 204 and C 205 are computers,adapted to act as interfaces between a network 206 and databases. In oneembodiment the database servers 203, 204 and 205 are server computers.In another embodiment, they act as peer computers.

[0043] As discussed above, many databases contain many fields in whichthe individual constants correspond to entities in the real world.Examples of such name domains include course numbers, personal names,company names, movie names, and place names. In general, the mappingfrom name constants to real entities can differ in subtle ways fromdatabase to database, making it difficult to determine if two constantsare co-referent ({i.e.}, refer to the same entity).

[0044] For instance, in two Web databases listing educational softwarecompanies, one finds the name constants “Microsoft” and “MicrosoftKids.” Do these denote the same company, or not? In another pair of Websources, the names “Kestrel” and “American Kestrel” appear. Likewise, itis unclear as to whether these denote the same type of bird. Otherexamples of this problem include “MIT” and “MIT Media Labs”; and “A&TBell Labs,” “AT&T Labs”, “AT&T Labs—Research,” “AT&T Research,” “BellLabs,” and “Bell Telephone Labs.”

[0045] As can be seen from the above examples, determining if two nameconstants are co-referent is far from trivial in many real-world datasources. Frequently it requires detailed knowledge of the world, thepurpose of the user's query, or both. These generally necessitate humanintervention in preprocessing or otherwise handling a user query.

[0046] Unfortunately, answering most database queries requireunderstanding which names in a database are coreferent. Two phrases arecoreferent if each refers to the same or approximately the same externalentity. An external entity is an entity in the real world to which aphrase refers. For example, Microsoft and Microsoft, Inc. are twophrases that are coreferent in the sense that they refer to the samecompany. As used herein, the term “phrase” means any fragment of textdown to a single character, e.g., a word, a collection of words, aletter, several letters, a number, a punctuation mark or set ofpunctuation marks, etc.

[0047] This requirement of understanding which names in a database arecoreferent poses certain problems. For example, to join two databases onCompany_name fields, where the values of the company names are Microsoftand Microsoft Kids, one must know in advance if these two names aremeant to refer to the same company. This suggests extending databasesystems to represent the names explicitly so as to compute theprobability that two names are coreferent. This in turn requires thatthe database includes an appropriate way of representing text (phrases).

[0048] One widely used method for representing text briefly describedabove is the vector space model. Assume a vocabulary T of terms, eachwhich will be treated as atomic, i.e., unbreakable. Terms can includewords, phrases, or word stems, which are morphologically derived wordprefixes. A fragment of text is represented as DocumentVector, which isa vector of real numbers v εR^(|T|), each component of which correspondsto a term τΣT. The component of v which corresponds to τΣT is denotedv^(t).

[0049] A number of schemes have been proposed for assigning weights toterms, as discussed above. An embodiment of the present invention usesthe TF-IDF weighting scheme with unit length normalization. Assumingthat the document represented by v is a member of a document collectionC, define {circumflex over (ν)}^(t) to have the value zero if t is notpresent in the document represented by v, and otherwise the value{circumflex over (ν)}^(t)=(log(TF_(v,t))+1)·log(IDF_(t)), where the“term frequency” is the number of times that term t occurs in thedocument represented by v, and the inverse document frequency IDF_(t) is$\frac{C}{C_{t}},$

[0050] where C_(t) is the subset of documents in C that contain the termt. This vector is then normalized to unit length, leading to thefollowing weight for v^(t):$v^{2} = \frac{v^{t}}{\sqrt{\sum\limits_{tET}\left( v^{t} \right)^{2}}}$

[0051] The “similarity” of two document vectors v and w is given by theformula: sim (v, w)= ${\sum\limits_{t{\sum T}}{v^{t} \cdot w^{t}}},$

[0052] which is usually interpreted as the cosine of the angle between vand w. Since every document vector v has unit length, sim (v, w) isalways between zero and one.

[0053] Although these vectors are conceptually very long, they are alsovery sparse: if a document contains only k terms, then all but kcomponents of its vector representation will have zero weight. Methodsfor efficiently manipulating these sparse vectors are known. The vectorspace representation for documents is described in Automatic TextProcessing, edited by Gerard Salton, Addison Welsley, Reading, Mass.,1989.

[0054] The general idea behind this scheme is that the magnitude of thecomponent v^(t) is related to the “importance” of the term t in thedocument represented by v. In accordance with the present invention, twodocuments are similar when they share many “important” terms. The TF-IDFweighting scheme assigns higher weights to terms that occur infrequentlyin the collection C. The weighting scheme also gives higher weights toterms that occur frequently in a document. However, in this context,this heuristic is probably not that important, since names are usuallyshort enough so that each term occurs only once. In a collection ofcompany names, for instance, common terms like “Inc.” and “Ltd.” wouldhave low weights. Uniquely appearing terms like “Lucent” and “Microsoft”would have high weights. And terms of intermediate frequency like Acmeand American would have intermediate weights.

[0055] The present invention operates on data is stored in relations,where the primitive elements of each relation are document vectors,rather than atoms. This data model is called SUR, which stands forSimple Texts In Relations. The term “simple” indicates that noadditional structure is assumed for the texts.

[0056] More precisely, an extensional database (EDB) consists of a termvocabulary T and set of relations {p₁, . . . p_(n)}. Associated witheach relation p is a set of tuples called tuples(p). Every tuple (v₁, .. . , v_(k)) ε tuples (p) has exactly k components, and each of thesecomponents v_(i) is a document vector. It is also assumed that a scoreis associated with every tuple in p. This score will always be betweenzero and one, and will be denoted score ((v₁, . . . , v_(k)) ε tuples(p)). In most applications, the score of every tuple in a base relationwill be one; however, in certain embodiments, non-unit scores can occur.This allows materialized views to be stored.

[0057] An embodiment of a language for accessing these relations inaccordance with the present invention is called WHIRL, which stands forWord-based Heterogeneous Information Retrieval Logic. A conjunctiveWHIRL query is written B₁

. . .

B_(k), where each B_(i) is a literal. There are two types of literals.An EDB literal is written p(X₁, . . . , X_(k)) where p is the name of anEDB relation, and the X_(i)'s are variables. A similarity literal iswritten X˜Y, where X and Y are variables. Intuitively, this can beinterpreted as a requirement that documents X and Y be similar. If Xappears in a similarity literal in a query Q, then X also appears insome EDB literal in Q.

[0058] To take another example, consider two relations R and S, wheretuples of R contain a company name and a brief description of theindustry associated with that company, and tuples of S contain a companyname and the location of the World Wide Web homepage for that company.The join of the relations R and S might be approximated by the query:

Q₁: r(Company1,Industry)

s (Company2,WebSite)

Company1˜Company2

[0059] This is different from an equijoin of R and S, which could bewritten:

r(Company,Industry)

s(Company,WebSite).

[0060] To find Web sites for companies in the telecommunicationsindustry one might use the query:

Q₂: r(Company1,Industry)

s(Company2,WebSite)

Company1˜Company2

const1(IO)

Industry˜IO

[0061] where the relation {const1} contains a single document describingthe industry of interest, such as “telecommunications equipment and/orservices”.

[0062] The semantics of WHIRL are defined in part by extending thenotion of score to single literals, and then to conjunctions. Thesemantics of WHIRL are best described in terms of substitutions. Asubstitution θ is a mapping from variables to document vectors. Asubstitution is denoted as θ={X₁=v_(i), . . . , X_(n)=v_(n)}, where eachX_(i) is mapped to the vector v_(i). The variables X_(i) in thesubstitution are said to be “bound” by θ. If Q is a WHIRL query (or aliteral or variable) then Qθ denotes the result of applying that mappingto Q, i.e., the result of taking Q and replacing every variable X_(i)appearing in Q with the corresponding document vector v_(i). Asubstitution θ is “ground for Q” if Qθ contains no variables.

[0063] Suppose B is a literal, and θ is a substitution such that Bθ isground. If B is an EDB literal p(X₁, . . . ,X_(k)), thenscore(Bθ)=score((X₁θ, . . . ,X_(k)θ) εp) if (X₁θ, . . . ,X_(k)θ) ε intuples(p), and score(Bθ)=0 otherwise. If B is a similarity literal X˜Y,then score (Bθ)=sim (Xθ, Yθ).

[0064] If Q=B₁

. . .

B_(k) is a query and Qθ is ground, then define score (Qθ)=II_(i=1) ^(n)score(B,θ). In other words, conjunctive queries are scored by combiningthe scores of literals as if they were independent probabilities.

[0065] Recall that the answer to a conventional conjunctive query is theset of ground substitutions that make the query “true,” i.e., provableagainst the EDB. In WHIRL, the notion of provability has been replacedwith the “soft” notion of score: substitutions with a high score areintended to be better answers than those with a low score. It seemsreasonable to assume that users will be most interested in seeing thehigh-scoring substitutions, and will be less interested in thelow-scoring substitutions. This is formalized as follows: Given an EDB,the “full answer set” S_(Q) for a conjunctive query Q is defined to bethe set of all θ such that Qθ is ground and has a non-zero score. Anr-answer R_(Q) for a conjunctive query Q is defined to be an orderedlist of substitutions θ₁, . . .,θ_(i) from the full answer set suchthat:

[0066] for all θ_(i) εR_(Q) and σεS_(Q)

R_(Q); score (Q θ_(i))≧score(Qσ); and

[0067] for all θ_(ij) θ_(j) εR_(Q) where i<j, score(Qθ_(i))≧score(Qθ_(j)).

[0068] In other words, R_(Q) contains r highest-scoring substitutions,ordered by non-increasing score.

[0069] It is assumed that the output of a query -answering algorithmgiven the query Q will not be a full answer set, but rather an r-answerfor Q, where r is a parameter fixed by the user. To understand thenotion of an r-answer, observe that in typical situations the fullanswer set for WHIRL queries will be very large. For example, the fullanswer set for the query Q₁ given as an example above would include allpairs of company names Company1, Company2 that both contain the term“Inc.” This set might be very large. Indeed, if it is assumed that afixed fraction $\frac{1}{k}$

[0070] of company names contain the term “Inc.”, and that R and S eachcontain a random selection of n company names, then one would expect thesize of the full answer set to contain $\left( \frac{n}{k} \right)^{2}$

[0071] substitutions simply due to the matches on the term “Inc.”Further, the full answer set for the join of m relations of this sortwould be of size at least $\left( \frac{n}{k} \right)^{m}.$

[0072] To further illustrate this point, I computed the pairwisesimilarities of two lists R and S of company names with R containing1163 names, S containing 976 names. These lists are the relationsHoovers Web 301 and Iontech 302 shown in FIG. 3. Although theintersection of R and S appears to contain only about 112 companies,over 314,000 name pairs had non-zero similarity. In this case, thenumber of non-zero similarities can be greatly reduced by discarding afew very frequent terms like “Inc.” However, even after thispreprocessing, there are more than 19,000 non-zero pairwisesimilarities, which is more than 170 times the number of correctpairings. This is due to a large number of moderately frequently terms(like “American” and “Airlines”) that cannot be safely discarded. Thus,it is in general impractical to compute full answer sets for complexqueries and present them to a user. This leads to the assumption of anr-answer, which advantageously simplifies the results provided inaccordance with the present invention.

[0073] The scoring scheme given above for conjunctive queries can befairly easily extended to certain more expressive languages inaccordance with the present invention. Below, I consider such anextension, which corresponds to projections of unions of conjunctivequeries.

[0074] A “basic WHIRL clause” is written p(X₁, . . . ,X_(k))←Q, where Qis a conjunctive WHIRL query that contains all of the X_(i)'s. A “basicWHIRL view υ” is a set of basic WHIRL clauses with heads that have thesame predicate symbol p and arity k. Notice that by this definition, allthe literals in a clause body are either EDB literals or similarityliterals. In other words, the view is flat, involving only extensionallydefined predicates.

[0075] Now, consider a ground instance a=p(x₁, . . . ,x_(k)) of the headof some view clause. The “support of a” (relative to the view U and agiven EDB) is defined to be the following set of triples:

[0076] support (a)={(A←Q,θ,3): (A←Q)ευand Aθ=a and score (Qθ)=s and s>0}The score of (x₁, . . . ,x_(k)) in p is defined as follows:$\begin{matrix}{{{score}\quad \left( {\left( {x_{1},\ldots \quad,x_{k}} \right) \in p} \right)} = {1 - {\prod\limits_{{({C,\Theta,s})} \in {{support}{({p{({x_{1},\ldots \quad,x_{L}})}})}}}\left( {1 - s} \right)}}} & \text{Equation (1)}\end{matrix}$

[0077] To understand this formula, note that it is some sense a dual ofmultiplication: if e₁ and e₂ are independent probabilistic events withprobability p₁ and p₂respectively, then the probability of (e₁

e₂) is p₁·p₂, and the probability of (e₁

e₂) is 1−(1−p₁)(1-p₂). The “materialization of the view υ” is defined tobe a relation with name p which contains all tuples (x₁, . . . ,x_(k))such that score((x₁, . . . ,x_(k))εp)>0).

[0078] Unfortunately, while this definition is natural, there is adifficulty with using it in practice. In a conventional setting, it iseasy to materialize a view of this sort, given a mechanism for solving aconjunctive query. In WHIRL, one would prefer to assume only a mechanismfor computing r-answers to conjunctive queries. However, since Equation(1) involves a support set of unbounded size, it appears that r-answersare not enough to even score a single ground instance a.

[0079] Fortunately, however, low-scoring substitutions have only aminimal impact on the score of a. Specifically, if (C,θ,s) is such thats is close to zero, then the corresponding factor of (1−s) in the scorefor a is close to one. One can thus approximate the score of Equation(1) using a smaller set of high-scoring substitutions, such as thosefound in an r-answer for moderately large r.

[0080] In particular, let υ contain the clauses A₁←Q₁, . . . ,A_(n)←Q_(n), let R_(Q1), . . . ,R_(Qn) be r-answers for the Q_(i)'s, andlet R=U_(i)R_(Qi). Now define the “r-support for a from R” to be theset:

{(A←Q,θ,s): (A←Q,θ,s) εsupport(a) and θεR}

[0081] Also define the r-score for a from R by replacing support (a) inEquation (1) with the r-support set for a. Finally, define the“r-materialization of υ from R” to contain all tuples with non-zeror-score, with the score of x₁, . . . ,x_(k) in p being its r-score fromR.

[0082] Clearly, the r-materialization of a view can be constructed usingonly an r-answer for each clause body involved in the view. As r isincreased, the r-answers will include more and more high-scoringsubstitutions, and the r-materialization will become a better and betterapproximation to the full materialized view. Thus, given an efficientmechanism for computing r-answers for conjunctive views, one canefficiently approximate the answers to more complex queries.

[0083] One embodiment of WHIRL implements the operations of finding ther-answer to a query and the r-materialization of a view. As noted above,r-materialization of a view can be implemented easily given a routinefor constructing r-answers. First, however, I will give a short overviewof the main ideas used in the process.

[0084] In an embodiment of WHIRL, finding an r-answer is viewed as anoptimization problem. In particular, the query processing algorithm usesa general method called A* search to find the highest-scoring rsubstitutions for a query. The A* search method is described inPrinciples of Artificial Intelligence, by Nils Nilsson, Morgan Kaufmann,1987. Viewing query processing as search is natural, given that the goalis to find a small number of good substitutions, rather than allsatisfying substitutions. The search method of one embodiment alsogeneralizes certain techniques used in IR ranked retrieval. However,using search in query processing is unusual for database systems, whichmore typically use search only in optimizing a query.

[0085] To understand the use of search, consider finding an r-answer tothe WHIRL query insiderTip(X)

publicly Traded(Y)

X˜Y, where the relation publicly Traded is very large, but the relationinsiderTip is very small. In processing the corresponding equijoininsiderTip(X)

publicly Traded(Y)

X=Y with a known database system, one would first construct a queryplan.

[0086] For example, one might first find all bindings for X, and thenuse an index to find all values Y in the first column of publicly Tradedthat are equivalent to some X. It is tempting to extend such a queryplan to WHIRL, by simply changing the second step to find all values Ythat are similar to some X. However, this natural extension can be quiteinefficient. Imagine that insiderTip contains the vector xi,corresponding to the document “Armadillos, Inc.” Due to the frequentoccurrence of the term “Inc.”, there will be many documents Y that havenon-zero similarity to x₁, and it will be expensive to retrieve all ofthese documents Y and compute their similarity to x₁. One way ofavoiding this expense is to start by retrieving a small number ofdocuments Y that are likely to be highly similar to x₁. In this case,one might use an index to find all Y's that contain the rare term“Armadillos.” Since “Armadillos” is rare, this step will be inexpensive,and the Y's retrieved in this step must be somewhat similar to x₁.Recall that the weight of a term depends inversely on its frequency, sorare terms have high weight, and hence these Y's will share at least onehigh-weight term with X. Conversely, any Y's not retrieved in this stepmust be somewhat dissimilar to X₁, since such a Y cannot share with thehigh-weight term “Armadillos.” This suggests that if r is small, and anappropriate pruning method is used, a subtask like “find the r documentsY that are most similar to x₁” might be accomplished efficiently by thesubplan of “find all Y's containing the term ‘Armadillos’.” Of course,this subplan depends on the vector x₁.

[0087] To find the Y's most similar to the document “The AmericanSoftware Company” (in which every term is somewhat frequent), a verydifferent type of subplan might be required. The observations suggestthat query processing should proceed in small steps, and that thesesteps should be scheduled dynamically, in a manner that depends on thespecific document vectors being processed.

[0088] The query processing method described below searches through aspace of partial substitutions. Each substitution is a list of valuesthat could be assigned to some, but not necessarily all, of the valuesappearing in the query. For example, one state in the search space forthe query given above would correspond to the substitution that maps Xto x₁ and leaves Y unbound. Each state in the search space is a “partiallist” of possible variable bindings. As used herein, a “partial list”(possible variable bindings) can include bindings to all variables inthe query, or bindings to some subset of those variables, including theempty set. The steps taken through this search space are small ones, assuggested by the discussion above. For instance, one operation is toselect a single term t and use an inverted index to find plausiblebindings for a single unbound variable. Finally, the search algorithmorders these operations dynamically, focusing on those partialsubstitutions that seem to be most promising, and effectively pruningpartial substitutions that cannot lead to a high scoring groundsubstitution.

[0089] A* search is a graph search method which attempts to find thehighest scoring path between a given start state so and a goal state. Apseudo-code embodiment of A* search as used in an embodiment of thepresent invention is as, follows:

[0090] procedure A* (r s₀, goalState (.), children(.))

[0091] Begin

[0092] OPEN={s₀}

[0093] while (OPEN≠Ø) do

[0094] s:=argmax, _(εOPEN) ^(h)(s′)

[0095] OPEN:=OPEN−{s}

[0096] If goalState(s) then

[0097] output <s, h (s)>

[0098] Exit if r answers printed

[0099] else

[0100] OPEN:=OPEN U children(s)

[0101] endif

[0102] endwhile

[0103] end

[0104] Initial state s₀: <Ø, Ø>

[0105] goalState (<Ø, E>): true iff Q Ø is ground

[0106] children (<Ø, E>):

[0107] if constrain (<Ø, E>)≠Ø then return constrain (<Ø, E>)

[0108] else return explode (<Ø, E>)

[0109] constrain (<Ø, E>):

[0110] 1. pick X, Y, t where

[0111] Xθ=x,

[0112] Y is unbound in θ with generator p and generation index l (seetext)

[0113] x^(t)- maxweight (t, p, l) is maximal over all such X, Y, tcombinations

[0114] 2. If no such X, Y, t exists then return Ø

[0115] 3. return {<Ø, E′>): U {Ø₁, E>, . . . , <Ø_(n), E>}

[0116] where E′=E U {t, Y>}, and

[0117] each θ; is θU {Y₁=v₁, . . . , Y_(k)=v_(k)} for some <v₁, . . .v_(k)>ε index (t, p, l) and

[0118] θ₁ is E-valid.

[0119] explode (<θ, E>):

[0120] pick p (Y₁, . . . ,Y_(k)) such all Y_(i)'s are unbound by θ

[0121] return the set of all (θ U {Y₁₌v₁, . . . , Y_(k)=v_(k)}, E>

[0122] such that (v_(i), . . . , v_(k)>ε tuples (p) and θU {Y₁=v₁, . . ., Y_(k)=v_(k)} is E-valid.

[0123] h<<θ, E>): Π(_(i=1) ^(h′)(B_(i),θ) where

[0124] h′(B_(i) θ)=score (B_(i) θ) for ground B_(i) θ

[0125] h′((X˜Y) θ)=

[0126] Σ_(T) εT: (t,Y)gE^(xt.maxweight()t, p, l)

[0127] where Xθ=x, Y is unbound index l (see text)

[0128] generator p and generation index l (see text)

[0129] As can be seen in the above pseudo-code, goal states are definedby a goalState predicate. The graph being searched is defined by afunction children(s), which returns the set of states directly reachablefrom state s. To conduct the search, the A* algorithm maintains a setOPEN of states that might lie on a path to some goal state. InitiallyOPEN contains only the start state s₀.

[0130] At each subsequent step of the algorithm, a single state isremoved from the OPEN set; in particular, the state s that is “best”according to a heuristic function, h(s), is removed from OPEN. If s is agoal state, then this state is output; otherwise, all children of s areadded to the OPEN set. The search continues until r goal states havebeen output, or the search space is exhausted.

[0131] I will now explain how this general search method has beeninstantiated in WHIRL in accordance with an embodiment of the presentinvention. I will assume that in the query Q, each variable in Q appearsexactly once in a EDB literal. In other words, the variables in EDBliterals are distinct from each other, and also distinct from variablesappearing in other EDB literals, and both variables appearing in asimilarity literal also appear in some EDB literal. (This restriction ismade innocuous by an additional predicate eq(X,Y) which is true when Xand Y are bound to the same document vector. The implementation of theeq predicate is straightforward and known in the art, and will beignored in the discussion below.) In processing queries, the followingdata structures will be used. An inverted index will map terms tεT tothe tuples that contain them: specifically, I assume a function index(t,p,i) which returns the set of tuples (v₁, . . . , v_(i), . . . ,v_(k)) in tuples(p) such that v_(i) ^(t)>0. This index can be evaluatedin linear time (using an appropriate data structure) and precomputed inlinear time from the EDB. I also precompute the function maxweight(t,p,i), which returns the maximum value of v_(i) ^(t) over alldocuments v _(i)in the i-th column of p. Inverted indices are commonlyused in the field on information retrieval, and means of storing andaccessing them efficiently are well known to those skilled in the art ofinformation retrieval. The maxweight function is also used in many knowntechniques for speeding up processing of ranked retrieval queries, suchas those described in Turtle and Flood.

[0132] The states of the graph searched will be pairs (θ,E), where θ isa substitution, and E is a set of exclusions. Goal states will be thosefor which θ is ground for Q, and the initial state s₀ is (0,0). Anexclusion is a pair (t,Y) where t is a term and Y is a variable.Intuitively, it means that the variable Y must not be bound to adocument containing the term t. Formally, I say that a substitution θ isE-valid in ∀(t,Y)εE, (Yθ)^(t)=0. Below I define the children function sothat all descendants of a node <s,E>must be E-valid; making appropriateuse of these exclusions will force the graph defined by the childrenfunction to be a tree.

[0133] I will adopt the following terminology. Given a substitution θand query Q, a similarity literal X˜Y is constraining if and only ifexactly one of Xθ and Yθ are ground. Without loss of generality, Iassume that Xθ is ground and Yθ is not. For any variable Y, the EDBliteral of Q that contains Y is the generator for Y, the position l of Ywithin this literal is Y's generation index. For well-formed queries,there will be only one generator for a variable Y.

[0134] Children are generated in two ways: by exploding a state, or byconstraining a state. Exploding a state corresponds to picking allpossible bindings of some unbound EDB literal. To explode a states=<θ,E>, pick some EDB literal p(Y₁, . . . , Y_(k)) such that all theY_(i)'s are unbound by θ, and then construct all states of the form(θ∪{Y₁=v₁, . . . ,Y_(k)=v_(k)},E) such that (v₁, . . . ,v_(k)) ε intuples(p) and θ∪{Y₁=v₁, . . . ,Y_(k)=v_(k)} is E-valid. These are thechildren of s.

[0135] The second operation of constraining a state implements a sort ofsideways information passing. To constrain a state s=<θ,E>, pick someconstraining literal X˜Y and some term t with non-zero weight in thedocument Xθ such that <t,Y>E. Let p(Y₁, . . . ,Y_(k)) be the generatorfor the (unbound) variable Y, and let l be Y's generation index. Twosets of child states will now be constructed. The first is a singletonset containing the state s′=<θ,E>, where E′=E∪{<t,Y>}. Notice that byfurther constraining s′, other constraining literals and other terms tin Xθ can be used to generate plausible variable bindings. The secondset S_(t) contains all states <θ_(i),E> such that θ_(i)=θ∪{Y₁=v₁, . . ., Y_(k)=v_(k)} for some <v₁, . . . , v_(k)>ε index(t,p,l) and θ isE-valid. The states in S_(t) thus correspond to binding Y to some vectorcontaining the term t. The set children(s) is S_(t)∪{s′}.

[0136] It is easy to see that if s_(i) and s_(j) are two differentstates in S_(t), then their descendants must be disjoint. Furthermore,the descendants of s′ must be disjoint from the descendants of any s₁εS_(t), since all descendants of s′ are valid for E′, and none of thedescendants of s₁ can be valid for E′. Thus the graph generated by thischildren function is a tree.

[0137] Given the operations above, there will typically be many ways to“constrain” or “explode” a state. In the current implementation ofWHIRL, a state is always constrained using the pair <t,Y>, such thatx^(t)·maxweight(t,p,l) is maximal, where p and l are the generator andgeneration index for Y. States are exploded only if there are noconstraining literals, and then always exploded using the EDB relationcontaining the fewest tuples.

[0138] It remains to define the heuristic function, which, whenevaluated, produces a heuristic value. Recall that the heuristicfunction h(θ,E) must be admissible, and must coincide with the scoringfunction (Qθ) on ground substitutions. This implies that h(θ,E) must bean upper bound on score(q) for any ground instance q of Qθ. I thusdefine h(θ,E) to be II_(t=1) ^(k)h^(l)(B_(t),Θ,E), where h′ will be anappropriate upper bound on score (B_(i)θ). I will let this bound equalscore (B_(i)θ) for ground (B_(i)θ), and let it equal 1 for non-groundB_(i), with the exception of constraining literals. For constrainingliterals, h′(*) is defined as follows:${h^{\prime}\left( {B_{t},\Theta,E} \right)} \equiv {\sum\limits_{{t \in T};{{({t,Y})} \notin \in}}{x^{t} \cdot {{maxweight}\left( {t,p,l} \right)}}}$

[0139] where p and l are the generator and generation index for Y. Notethat this is an upper bound on the score of B_(i)σ relative to anyground superset σ of θ that is E-valid.

[0140] In the current implementation of WHIRL, the terms of a documentare stems produced by the Porter stemming algorithm. The Porter stemmingalgorithm is described in “An Algorithm for Suffix Stripping”, by M. F.Porter, Program, 14(3):130-137, 1980. In general, the term weights for adocument v_(i) are computed relative to the collection C of alldocuments appearing in the i-th column of p. However, the TF-IDFweighting scheme does not provide sensible weights for relations thatcontain only a single tuple. (These relations are used as a means ofintroducing “constant” documents into a query.) Therefore weights forthese relations must be calculated as if they belonged to some othercollection C′.

[0141] To set these weights, every query is checked before invoking thequery algorithm to see if it contains any EDB literals p(X₁, . . .,X_(k)) for a singleton relation p. If one is found, the weights for thedocument x_(i) which a variables will be bound are computed using thecollection of documents found in the column corresponding to Y_(i),where Y_(i) is some variable that appears in a similarity literal withX_(i). If several such Y_(i)'s are found, one is chosen arbitrarily. IfX_(i) does not appear in any similarity literals, then its weights areirrelevant to the computation.

[0142] The current implementation of WHIRL keeps all indices anddocument vectors in main memory.

[0143] In the following examples of the procedure in accordance with thepresent invention, it is assumed that terms are words.

[0144] Consider the query “const1(IO)

p(Company,Industry)

Industry˜IO”, where const1 contains the single document“telecommunications services and/or equipment”. With θ=0, there are noconstraining literals, so the first step in answering this query will beto explode the smallest relation, in this case const1. This will produceone child, s₁, containing the appropriate binding for IO, which will beplaced on the OPEN list.

[0145] Next s₁ will be removed from the OPEN list. Since Industry˜IO isnow a constraining literal, a term from the bound variable IO will bepicked, probably the relatively rare stem “telecommunications”. Theinverted index will be used to find all tuples <co₁ind₁>, . . . ,<co_(n)ind_(n)> such that ind₁ contains the term “telecommunications”,and n child substitutions that map Company=co_(i) and Industry=ind_(i)will be constructed. Since these substitutions are ground, they will begiven h(*) values equal to their actual scores when placed on the OPENlist. A new state s′₁ containing theexclusion(telecommunications,Industry)will also be placed on the OPENlist. Note that h(s′₁)<h(s₁), since the best possible score for theconstraining literal Industry˜IO can match at most only four terms:“services” “and”, “or”, “equipment”, all of which are relativelyfrequent, and hence have low weight.

[0146] Next, a state will again be removed from the OPEN list. It may bethat h(s′₁) is less than the h(*) value of the best goal state; in thiscase, a ground substitution will be removed from OPEN, and an answerwill be output. Or it may be that h(s′₁) is higher than the best goalstate, in which case it will be removed and a new term, perhapsequipment”, will be used to generate some additional groundsubstitutions. These will be added to the OPEN list, along with a statewhich has large exclusion set and thus a lower value.

[0147] This process will continue until documents are generated. Notethat it is quite likely that low weight terms such as “or” will not beused at all.

[0148] In another example of the present invention, consider the query

p(Company1,Industry) {circumflex over ()} q(Company2,WebSite){circumflex over ()} Company1˜Company2

[0149] In solving this query, the first step will be to explode thesmaller of these relations. Assume that this is p, and that p contains1000 tuples. This will add 1000 states s₁, . . . ,s₁₀₀₀ to the OPENlist. In each of these states, Company1 and Industry are bound, andCompany1˜Company2 is a constraining literal. Thus each of these 1000states is analogous to the state s₁ in the preceding example.

[0150] However, the h(*) values for the states s₁, . . . ,s₁₀₀₀ will notbe equal. The value of the state s₁ associated with the substitutionθ_(i) will depend on the maximum possible score for the literalCompany1˜Company2, and this will be large only if the high-weight termsin the document Company1θ_(i) appear in the company field of q. As anexample, a one-word document like “3Com” will have a high h(*) value ifthat term appears (infrequently) in the company field of q, and a zeroh(*) value if it does not appear; similarly, a document like “Agents,Inc” will have a low h(*) value if the term “agents” does not appear inthe first column of q.

[0151] The result is that the next step of the algorithm will be tochoose a promising state from the OPEN list, a state that could resultin an good final score. A term from the Company1 document in s₁, e.g.,“3Com”, will then be picked and used to generate bindings for Company2and WebSite. If any of these bindings results in perfect match, then ananswer can be generated on the next iteration of the algorithm.

[0152] In short, the operation of WHIRL is somewhat similar totime-sharing 1000 simpler queries on a machine for which the basic unitof computation is to access a single inverted index. However, WHIRL'suse of the h(*) function will schedule the computation of these queriesin an intelligent way: queries unlikely to produce good answers can bediscarded, and low-weight terms are unlikely to be used.

[0153] In yet another example, consider the query

p(Company1,Industry) {circumflex over ()} q(Company2,WebSite){circumflex over ()} Company1˜Company2 {circumflex over ()} const1(IO){circumflex over ()} Industry˜IO,

[0154] where the relation const1 contains the single document,“telecommunicationsand/or equipment”. In solving this query, WHIRL willfirst explode const1 and generate a binding for IO. The literalIndustry˜IO then becomes constraining, so it will be used to pickbindings for Company1 and Industry using some high-weight term, perhaps“telecommunications”.

[0155] At this point there will be two types of states on the OPEN list.There will be one state s′ in which only IO is bound, and(telecommunications,Industry) is excluded. There will also be severalstates s₁, . . . ,s_(n) in which IO, Company1 and Industry are bound; inthese states, the literal Company1˜Company2 is constraining. If s′ has ahigher score than any s_(i), then s′ will be removed from the OPEN list,and another term from the literal Industry˜IO will be used to generateadditional variable bindings.

[0156] However, if some s_(i) literal has a high h(*) value, then itwill be taken ahead of s′. Note that this possible when the bindings ins_(i) lead to a good actual similarity score for Industry˜IO as well asa good potential similarity score for Company1˜Company2 (as measured bythe h′(*) function). If an s_(i) is picked, then bindings for Company 2and WebSite will be produced, resulting a ground state. This groundstate will be removed from the OPEN list on the next iteration only ifits h(*) value is higher that of s′ and all of the remaining s_(i).

[0157] This example illustrates how bindings can be propagated throughsimilarity literals. The binding for IO is first used to generatebindings for Company1 and Industry, and then the binding for Company1 isused to bind Company2 and Website. Note that bindings are generatedusing high-weight, low-frequency terms first, and low-weight,high-frequency terms only when necessary.

[0158] Embodiments of the invention have been evaluated on datacollected from a number of sites on the World Wide Web. I have evaluatedthe run-time performance with CPU time measurements on a specific classof queries, which I will henceforth call similarity joins. A similarityjoin is a query of the form p(X₁, . . . ,X_(i), . . . ,X_(k)){circumflex over ()} q(Y₁, . . . ,Y_(j), . . . ,Y_(b)) {circumflex over()} X_(i)˜Y_(j)

[0159] An answer to this query will consist of the r tuples from p and qsuch that X_(i) and Y_(j) are most similar. WHIRL was compared onqueries of this sort to the following known algorithms:

[0160] 1) The naive method for similarity joins takes each document inthe i-th column of relation p in turn, and submits it as a IR rankedretrieval query to a corpus corresponding to the j-column of relation q.The top r results from each of these IR queries are then merged to findthe best r pairs overall. This might be more appropriately be called a“semi-naive” method; on each IR query, I use inverted indices, but Iemploy no special query optimizations.

[0161] 2) WHIRL is closely related to the maxscore optimization, whichis described in Query Evaluation: Strategies and Optimizations by HowardTurtle and James Flood, in Information Processing and Management,31(6):831-850, November 1995. WHIRL was compared to a maxscore methodfor similarity joins; this method is analogous to the naive methoddescribed above, except that the maxscore optimization is used infinding the best r results from each “primitive” query.

[0162] I computed the top 10 answers for the similarity join of subsetsof the IMDB 303 and VideoFlicks 304 relations show in FIG. 3. Inparticular, I joined size n subsets of both relations, for variousvalues of n between 2000 and 30,000. WHIRL speeds up the maxscore methodby a factor of between 4 and 9, and speeds up the naive method by afactor of 20 or more. The absolute time required to compute the join isfairly modest. With n =30,000, WHIRL takes well under than a minute topick the best 10 answers from the 900 million possible candidates.

[0163] To evaluate the accuracy of the answers produced by WHIRL, Iadopted the following methodology. Again focusing on similarity joins, Iselected pairs of relations which contained two or more plausible “key”fields. One of these fields, the “primary key”, was used in thesimilarity literal in the join. The second key field was then used tocheck the correctness of proposed pairings; specifically, a pairing wasmarked as “correct” if the secondary keys matched (using an appropriatematching procedure) and “incorrect” otherwise.

[0164] I then treated “correct” pairings in the same way that “relevant”documents are typically treated in evaluation of a ranking proposed by astandard IR system. In particular, I measured the quality of a rankingusing non-interpolated average precision. To motivate this measurement,assume the end user will scan down the list of-answers and stop at someparticular target answer that he or she finds to be of interest. Theanswers listed below this “target” are not relevant, since they are notexamined by the user. Above the target, one would like to have a highdensity of correct pairings; specifically, one would like the set S ofanswers above the target to have high precision, where the precision ofS is the ratio of the number of correct answers in S to the number oftotal answers in S. Average precision is the average precision for all“plausible” target answers, where an answer is considered a plausibletarget only if it is correct. To summarize, letting a_(k) be the numberof correct answers in the first k, and letting c(k)=1 iff the k-thanswer is correct and letting c(k)=0 otherwise, average precision is thequantity $\sum\limits_{k = 1}^{r}{{c(k)} \cdot {\frac{a_{k}}{k}.}}$

[0165] I used three pairs of relations from three different domains. Inthe business domain, I joined Iontech 301 and Hoovers Web 302, usingcompany name as the primary key, and the string representing the “site”portion of the home page as a secondary key. In the movie domain, Ijoined Review 305 and MovieLink 306 (FIG. 3), using film names as aprimary key. As a secondary key, I used a special key constructed by thehand-coded normalization procedure for film names that is used in IM, animplemented heterogeneous data integration system described in QueryingHeterogeneous Information Sources Using Source Descriptions by Alon Y.Levy, Anand Rajaraman, and Joann J. Ordille, Proceedings of the 22ndInternational Conference on Very Large Databases (VLDB-96), Bombay,India, September 1996. In the animal domain, I joined Animal1 307 andAnimal2 308 (FIG. 3), using common names as the primary key, andscientific names as a secondary key (and a hand-coded domain-specificmatching procedure).

[0166] On these domains, similarity joins are extremely accurate. In themovie domain, the performance is actually identical to the hand-codednormalization procedure, and thus has an average precision of 100%. Inthe animal domain, the average precision is 92.1%, and in the businessdomain, average precision is 84.6%. These results contrast with thetypical performance of statistical IR systems on retrieval problems,where the average precision of a state-of-the art IR system is usuallycloser to 50% than 90%. In other words, the tested embodiment of thepresent invention was able to achieve results in an efficient, automaticfashion that were just as good as the results obtained using asubstantially more expensive technique involving hand-coding, i.e.,human intervention.

[0167] The foregoing has disclosed to those skilled in the arts ofinformation retrieval and database how to integrate information frommany heterogeneous sources using the method of the invention. While thetechniques disclosed herein are the best presently known to theinventor, other techniques could be employed without departing from thespirit and scope of the invention. For example, representations otherthan relational representations are used to store data; some of theserepresentations are described in Proceedings of the Workshop onManagement of Semistructured Data, edited by Dan Suciu, available fromhttp://www.research.att.com/˜suciu/workshop-papers.html. Many of theserepresentations also employ constant values as keys, and could benaturally extended to use instead textual values that are associatedwith each other based on similarity metrics.

[0168] In the process of finding answers with high score, the inventionemploys A* search. Many variants of this search algorithm are known andmany of these could be used. The current invention also outputs answertuples in an order that is strictly dictated by score; some variants ofA* search are known that require less compute time, but output answersin an order that is largely, but not completely, consistent with thisordering.

[0169] Methods are also known for finding pairs of similar keys by usingMonte Carlo sampling methods; these methods are described inApproximating Matrix Multiplication for Pattern Recognition Tasks, inEighth Annual ACM-.SIAM Symposium on Discrete Algorithms, pages 682-691,1997. For certain types of queries, these sampling methods could be usedinstead of, or as a supplement to, some variant of A* search.

[0170] Many different term-based similarity functions have been proposedby researchers in information retrieval. Many of these variants could beemployed instead of the function employed in the invention.

[0171] Finally, while the problem that motivated the development of thisinvention is integration of data from heterogeneous databases, there arepotentially other problems to which the present invention can beadvantageously applied. That being the case, the description of thepresent invention set forth herein is to be understood as being in allrespects illustrative and exemplary, but not restrictive.

What is claimed is:
 1. A method for answering queries concerninginformation stored in a set of collections, where each collectionincludes a structured entity, and where each structured entity includesa field, comprising the steps of: a. receiving a query that specifies i.a subset of the set of collections; ii. a logical constraint betweenfields that includes a requirement that a first field match a secondfield; b. automatically determining the probability that the first fieldmatches the second field based upon the contents of the fields; and c.generating a collection of lists in response to the query, where eachlist includes members of the subset of collections specified in thequery, and where each list has an estimate of the probability that themembers of the list satisfies the logical constraint specified in thequery.
 2. The method of claim 1 , wherein members of the set ofcollections are derived from a plurality of distinct sources.
 3. Themethod of claim 1 , wherein a collection of structured entities is arelation, and wherein a structured entity is a tuple.
 4. The method ofclaim 1 , wherein the first field and the second field include a groupof terms.
 5. The method of claim 4 , wherein a term corresponds to atleast one of the following: a word, a word prefix, a word suffix, and aphrase.
 6. The method of claim 4 , wherein the group of terms refers toan external entity.
 7. The method of claim 4 , wherein the group ofterms is represented by a vector, where each component of the vectorcorresponds to one of the terms of a set of terms that can possiblyoccur in the group, and where each component is assigned a valuecorresponding to a weight of the term of the component.
 8. The method ofclaim 7 , further comprising the step of obtaining a value representingthe similarity of a first vector to a second vector.
 9. The method ofclaim 8 , wherein obtaining a value representing the similarity of thefirst vector to the second vector comprises the steps of computing thesum of the product of the weight of each first vector component with theweight of each second vector component that represents the same term asthe first vector component.
 10. The method of claim 9 , furthercomprising the step of using the similarity value to determine theprobability that the first vector matches the second vector.
 11. Themethod of claim 7 , wherein the weight assigned to a componentcorresponding to a term is higher if the term is rare in the set ofcollections of structured entities.
 12. The method of claim 1 , whereinthe set of lists includes substantially all of a response set of Kpossible lists that are estimated to have the highest probability thatthe members of each list satisfies the logical constraint specified inthe query, where K is a parameter supplied by the user.
 13. The methodof claim 1 , further comprising the step of searching through a space ofpartial lists to find the lists that belong to the response set.
 14. Themethod of claim 13 , wherein searching through a space of partial listscomprises the steps of: i. choosing a partial list with an extremeheuristic value; ii. determining if the partial list is complete; iii.if the partial list is complete, then presenting the partial list to theuser as the answer to the query; iv. if the partial list is notcomplete, then extending the partial list by adding a member of the setof collections specified in the query to the partial list; v. assessingthe heuristic value of the extended partial list; and vi. repeatingsteps i. through iii. until at least K lists have been presented to theuser, where K is a parameter supplied by the user.
 15. The method ofclaim 14 , wherein a partial list is determined to be complete if itincludes a member of every collection of structured entities specifiedin the query.
 16. The method of claim 14 , wherein the heuristic valuefor a partial list is at least approximately equal to the upper bound ofthe estimated probability that any possible extension of the partiallist satisfies the logical constraint specified in the query.
 17. Themethod of claim 14 , wherein adding a new potential member to anexisting partial list comprises the steps of selecting a member of theset of collections specified in the query, and adding the selectedmember to the existing partial list.
 18. The method of claim 14 ,wherein adding a new member list to an existing partial list comprisesthe steps of: i. selecting a logical constraint from the query that afirst field match a second field, where a member of the set ofcollections specified in the query corresponding to the first field isincluded in the partial list; ii. selecting a term that is included inthe member of the partial list that corresponds to the first field; iii.finding a potential member that includes the selected term; and iv.adding the potential member that includes the selected term to theexisting partial list.
 19. An apparatus for answering queries concerninginformation stored in a set of collections, where each collectionincludes a structured entity, and where each structured entity includesa field, comprising: a. a processor; b. a memory that stores searchinstructions adapted to be executed by said processor to receive a querythat specifies a subset of the set of collections and a logicalconstraint between fields that includes a requirement that a first fieldmatch a second field, automatically determine the probability that thefirst field matches the second field based upon the contents of thefields, and generate a collection of lists in response to the query,where each list includes members of the subset of collections specifiedin the query, and where each list has an estimate of the probabilitythat the members of the list satisfies the logical constraint specifiedin the query, said memory coupled to said processor.
 20. The apparatusof claim 19 , further comprising a port adapted to be coupled to anetwork, said port coupled to said processor and said memory.
 21. Theapparatus of claim 19 , wherein said search instructions are furtheradapted to be executed by said processor to choose a partial list withan extreme heuristic value, determine if the partial list is complete,if the partial list is complete, then to present the partial list to theuser as the answer to the query, and if the partial list is notcomplete, then to extend the partial list by adding a member of the setof collections specified in the query to the partial list, to assessingthe heuristic value of the extended partial list, and to continue tosearch through a space of partial lists until at least K lists have beenpresented to the user, where K is a parameter supplied by the user. 22.A medium that stores instructions adapted to be executed by a processorto: a. receive a query that specifies i. a subset of the set ofcollections; ii. a logical constraint between fields that includes arequirement that a first field match a second field; b. automaticallydetermine the probability that the first field matches the second fieldbased upon the contents, of the fields; and c. generate a collection oflists in response to the query, where each list includes members of thesubset of collections specified in the query, and where each list has anestimate of the probability that the members of the list satisfies thelogical constraint specified in the query.
 23. A medium that storesinstructions adapted to be executed by a processor to: i. choose apartial list with an extreme heuristic value; ii. determine if thepartial list is complete; iii. if the partial list is complete, thenpresent the partial list to the user as the answer to the query; iv. ifthe partial list is not complete, then extend the partial list by addinga member of the set of collections specified in the query to the partiallist; v. assess the heuristic value of the extended partial list; andvi. repeat steps i. through iii. until at least K lists have beenpresented to the user, where K is a parameter supplied by the user. 24.A system for answering queries concerning information stored in a set ofcollections, where each collection includes a structured entity, andwhere each structured entity includes a field, comprising: a. means forreceiving a query that specifies i. a subset of the set of collections;ii. a logical constraint between fields that includes a requirement thata first field match a second field; b. means for automaticallydetermining the probability that the first field matches the secondfield based upon the contents of the fields; and c. means for generatinga collection of lists in response to the query, where each list includesmembers of the subset of collections specified in the query, and whereeach list has an estimate of the probability that the members of thelist satisfies the logical constraint specified in the query.
 25. Asystem for searching through a space of partial lists, comprising: i.means for choosing a partial list with an extreme heuristic value; ii.means for determining if the partial list is complete; iii. means for ifthe partial list is complete, then presenting the partial list to theuser as the answer to the query; iv. means for determining if thepartial list is complete; v. means for extending the partial list byadding a member of the set of collections YES specified in the query tothe partial list; v. means for assessing the heuristic value of theextended partial list; and vi. means for determining if at least K listshave been presented to the user, where K is a parameter supplied by theuser.